Exploratory Analysis

Load data.

df <- read_csv("e1-anonymous.csv")
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   workerId = col_character(),
##   condition = col_character(),
##   gender = col_character(),
##   age = col_character(),
##   education = col_character(),
##   chart_use = col_character(),
##   problems = col_character(),
##   interactions = col_character(),
##   trial = col_character(),
##   userInput = col_character()
## )
## See spec(...) for full column specifications.
head(df)
## # A tibble: 6 x 34
##   workerId batch bonus condition duration n_data_conds n_trials gender age  
##   <chr>    <dbl> <dbl> <chr>        <dbl>        <dbl>    <dbl> <chr>  <chr>
## 1 b888e39c    29  0.25 text          567.           16       18 woman  55-64
## 2 b888e39c    29  0.25 text          567.           16       18 woman  55-64
## 3 b888e39c    29  0.25 text          567.           16       18 woman  55-64
## 4 b888e39c    29  0.25 text          567.           16       18 woman  55-64
## 5 b888e39c    29  0.25 text          567.           16       18 woman  55-64
## 6 b888e39c    29  0.25 text          567.           16       18 woman  55-64
## # … with 25 more variables: education <chr>, chart_use <chr>, problems <chr>,
## #   abs_err <dbl>, causal_support <dbl>, count_GT <dbl>, count_GnT <dbl>,
## #   count_nGT <dbl>, count_nGnT <dbl>, ground_truth <dbl>, interactions <chr>,
## #   n <dbl>, p_treat <dbl>, payoff <dbl>, q_idx <dbl>, response_A <dbl>,
## #   response_B <dbl>, total_GT <dbl>, total_GnT <dbl>, total_nGT <dbl>,
## #   total_nGnT <dbl>, trial <chr>, trial_dur <dbl>, trial_idx <dbl>,
## #   userInput <chr>

Drop practice trial.

df = df %>% filter(trial != "practice")

Exclusions

Let’s exclude workers who miss either the trial where causal support is the largest or the trial where causal support is smallest. We define miss as absolute error greater than 50%.

exclude_df <- df %>%
  group_by(workerId) %>%
  summarise(
    max_trial_idx = which(trial_idx == -1)[1],
    max_trial_gt = ground_truth[[max_trial_idx]],
    max_trial_err = abs_err[[max_trial_idx]],
    min_trial_idx = which(trial_idx == -2)[1],
    min_trial_gt = ground_truth[[min_trial_idx]],
    min_trial_err = abs_err[[min_trial_idx]],
    max_exclude = max_trial_err > 0.5,
    min_exclude = min_trial_err > 0.5,
    exclude = max_trial_err > 0.5 | min_trial_err > 0.5,
  )
## `summarise()` ungrouping output (override with `.groups` argument)
head(exclude_df)
## # A tibble: 6 x 10
##   workerId max_trial_idx max_trial_gt max_trial_err min_trial_idx min_trial_gt
##   <chr>            <int>        <dbl>         <dbl>         <int>        <dbl>
## 1 0030b402            13            1          0.3              7      0.00150
## 2 006327b8             7            1          0.3             13      0.00150
## 3 0116d604             7            1          0.4             13      0.00150
## 4 0306e6e7             7            1          0.2             13      0.00150
## 5 039a9f69             7            1          0.1             13      0.00150
## 6 03fd4b52             7            1          0.55            13      0.00150
## # … with 4 more variables: min_trial_err <dbl>, max_exclude <lgl>,
## #   min_exclude <lgl>, exclude <lgl>

What proportion of workers are missing each attention check? How many workers are we leaving out with our exclusion criteria?

sum(exclude_df$max_exclude) / length(exclude_df$exclude)
## [1] 0.2268431
sum(exclude_df$min_exclude) / length(exclude_df$exclude)
## [1] 0.3572779
sum(exclude_df$exclude) / length(exclude_df$exclude)
## [1] 0.4820416

Apply the exclusion criteria.

df = exclude_df %>%
  select(workerId, max_exclude) %>%
  rename(exclude = max_exclude) %>%
  full_join(df, by = "workerId") %>%
  filter(!exclude) %>%
  select(-exclude)

How many participants per condition after exclusions? (target sample size was 80 per condition)

df %>%
  group_by(condition) %>%
  summarise(
    n = length(unique(workerId))
  )
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 5 x 2
##   condition     n
##   <chr>     <int>
## 1 aggbars      81
## 2 bars         86
## 3 filtbars     81
## 4 icons        81
## 5 text         80

Performance

Let’s start by looking at the over pattern of absolute error. Then we’ll look at the pattern of responses vs ground truth.

Absolute error

Here’s the absolute error per trial, separated by sample size and visualization condition.

df %>% 
  ggplot(aes(x = condition, y = abs_err)) +
    stat_eye() +
    geom_point(position = position_jitter(), alpha = 0.5) +
    theme_bw() +
    facet_grid(n ~ .)

Let’s also look at the average absolute error per worker and level of sample size, separated by visualization condition.

df %>% 
  group_by(workerId, condition, n) %>%
  summarise(
    avg_abs_err = mean(abs_err)
  ) %>%
  ungroup() %>%
  ggplot(aes(x = condition, y = avg_abs_err)) +
    stat_eye() +
    geom_point(position = position_jitter(), alpha = 0.5) +
    theme_bw() +
    facet_grid(n ~ .)
## `summarise()` regrouping output by 'workerId', 'condition' (override with `.groups` argument)

Absolute error seems to get higher as sample size increases consistent with our pilot. Absolute error seems highest with filtbars, unsurprisingly. It looks like icons may be especially helpful relative to other conditions at high sample size.

Perhaps more striking, absolute error is very high. This is probably a realistic reflection of the task difficulty for causal inferences.

Responses vs ground truth

Let’s see if we can make better sense of the results by looking at raw responses vs the ground truth. The red line represents perfect performance.

df %>% 
  ggplot(aes(x = ground_truth, y = response_A)) +
    geom_point(alpha = 0.4) +
    geom_abline(slope = 100, intercept = 0, color = "red") +
    theme_bw() +
    facet_grid(n ~ condition)

We expect to see an inverse S-shaped pattern here (i.e., a linear in log odds pattern with a slope less than one, at least above ground_truth = 0.5). We see this pattern here, but the responses are noisy.

We expect a linear model to fit these data well in log odds units, so lets sanity check this intuition by looking at causal support (i.e., logit(ground_truth)) vs the log ratio of responses.

df = df %>% 
  mutate(
    # adjust response units
    response_A = if_else(
      response_A > 99.5, 99.5,
      if_else(
        response_A < 0.5, 0.5,
        as.numeric(response_A))),
    response_B = if_else(
      response_B > 99.5, 99.5,
      if_else(
        response_B < 0.5, 0.5,
        as.numeric(response_B))),
    # calculate log response ratio
    lrr = log(response_A / 100) - log(response_B / 100)
  )
df %>%
  ggplot(aes(x = causal_support, y = lrr)) +
    geom_point(alpha = 0.4) +
    geom_abline(slope = 1, intercept = 0, color = "red") +
    coord_cartesian(
      xlim = c(-20, 35),
      ylim = c(-20, 35)
    ) +
    theme_bw() +
    facet_grid(n ~ condition)

This doesn’t look as informative as I’d like because the sampling of the ground truth is so heavily influenced by sample size. Let’s look at each level of sample size on it’s own axis scale.

df %>% 
  filter(n == 100) %>%
  ggplot(aes(x = causal_support, y = lrr)) +
    geom_point(alpha = 0.4) +
    geom_abline(slope = 1, intercept = 0, color = "red") +
    coord_cartesian(
      xlim = c(-10, 10),
      ylim = c(-10, 10)
    ) +
    theme_bw() +
    facet_grid(n ~ condition)

df %>% 
  filter(n == 500) %>%
  ggplot(aes(x = causal_support, y = lrr)) +
    geom_point(alpha = 0.4) +
    geom_abline(slope = 1, intercept = 0, color = "red") +
    coord_cartesian(
      xlim = c(-20, 20),
      ylim = c(-20, 20)
    ) +
    theme_bw() +
    facet_grid(n ~ condition)

df %>% 
  filter(n == 1000) %>%
  ggplot(aes(x = causal_support, y = lrr)) +
    geom_point(alpha = 0.4) +
    geom_abline(slope = 1, intercept = 0, color = "red") +
    coord_cartesian(
      xlim = c(-30, 30),
      ylim = c(-30, 30)
    ) +
    theme_bw() +
    facet_grid(n ~ condition)

df %>% 
  filter(n == 1500) %>%
  ggplot(aes(x = causal_support, y = lrr)) +
    geom_point(alpha = 0.4) +
    geom_abline(slope = 1, intercept = 0, color = "red") +
    coord_cartesian(
      xlim = c(-40, 40),
      ylim = c(-40, 40)
    ) +
    theme_bw() +
    facet_grid(n ~ condition)

It seems clear that slopes will be less than 1 in logit-logit coordinate space, at least for all but the smallest sample size, and that slopes are different on either side of causal_support = 0. We’ll want to make sure our model can account for this pattern.

One thing that stands out on these charts is that the ability to discriminate signal decreases at higher sample size. Although we expect to see people underestimate sample size, the logit-logit slope flattening out as the weight of evidence increases is an interesting pattern.

It’s hard to see other patterns without a model.

Response times

Let’s see a histogram of response times per trial. These are measured in seconds.

df %>%
  filter(trial_dur >= 0) %>%
  ggplot(aes(x = trial_dur)) +
   geom_histogram() +
    theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

As expected, this looks like a power law distribution.

Let’s separate this by condition to see if icons or text is taking systematically longer.

df %>%
  filter(trial_dur >= 0) %>%
  ggplot(aes(x = trial_dur)) +
    geom_histogram() +
    theme_bw() +
    facet_grid(condition ~ .)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

These distributions look similar. People seem to work most quickly in the text and icons conditions and slower with the bar chart variants.

What about the duration of the experiment per participant?

df %>%
  filter(duration >= 0) %>%
  filter(trial == 1) %>%
  ggplot(aes(x = duration)) +
    geom_histogram() +
    theme_bw() +
    facet_grid(condition ~ .)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Outliers make it hard to see anything in this data. Honestly the response time data is not too interesting, but it’s worth checking.

Demographics

Let’s check out out demographic variables in aggregate just to get a sense of our sample composition.

We did a free response for gender, so we’ll need to do a little lightweight text processing to generate a histogram. A few participants seem to have put their age in the gender box, so we are missing data for these folks. This categorization is not intended to be normative/prescriptive; this is just for the purpose of generating an approximate overview of gender balance in our sample.

df %>%
  filter(trial == 1) %>%
  rowwise() %>%
  mutate(
    gender = tolower(as.character(gender)),
    gender = case_when(
      grepl("woman", gender, fixed = TRUE) | grepl("female", gender, fixed = TRUE)       ~ "woman",
      (grepl("man", gender, fixed = TRUE) | grepl("male", gender, fixed = TRUE)) & 
        ! (grepl("woman", gender, fixed = TRUE) | grepl("female", gender, fixed = TRUE)) ~ "man",
      TRUE                                                                               ~ "other")
  ) %>%
  ggplot(aes(y = gender)) +
    geom_bar() + 
    theme_bw()

Our sample is pretty gender biased, consistent with our pilot data.

Let’s also check out age.

df %>%
  filter(trial == 1) %>%
  ggplot(aes(y = age)) +
    geom_bar() + 
    theme_bw()

Our sample skews toward younger people.

What about education?

df %>%
  filter(trial == 1) %>%
  ggplot(aes(y = education)) +
    geom_bar() + 
    theme_bw()

Lots of college educated MTurk workers. This is probably a good thing since we are studying how people do data analysis, and analysts tend to be college educated.

Last, let’s look at chart use.

df %>%
  filter(trial == 1) %>%
  ggplot(aes(y = chart_use)) +
    geom_bar() + 
    theme_bw()

Covariate effects on performance

Education and chart use are the only demographic variables that we collected which I would expect to have an impact on performance.

Let’s check out the relationships of those variables with avg absolute error.

df %>% 
  group_by(education) %>%
  summarise(
    count = length(unique(workerId)),
    workerId = list(workerId),
    abs_err = list(abs_err)
  ) %>%
  unnest(cols = c("workerId", "abs_err")) %>%
  group_by(workerId, education) %>%
  summarise(
    avg_abs_err = mean(abs_err),
    count = unique(count)
  ) %>%
  ungroup() %>%
  filter(count > 10) %>%
  ggplot(aes(x = avg_abs_err, y = education)) +
    stat_eyeh() +
    geom_point(position = position_jitter(), alpha = 0.5) +
    theme_bw()
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` regrouping output by 'workerId' (override with `.groups` argument)

df %>% 
  group_by(workerId, chart_use) %>%
  summarise(
    avg_abs_err = mean(abs_err)
  ) %>%
  ungroup() %>%
  ggplot(aes(x = avg_abs_err, y = chart_use)) +
    stat_eyeh() +
    geom_point(position = position_jitter(), alpha = 0.5) +
    theme_bw()
## `summarise()` regrouping output by 'workerId' (override with `.groups` argument)

These conditional distributions are mostly overlapping. The only factors that seem to make much of a difference are having a masters degree and daily chart use. Surprisingly, both of these groups are associated with larger average absolute error, which is the opposite of what we might expect.